IE582 - Homework 2ΒΆ

Zelina Genel - 2023802015ΒΆ

InΒ [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
InΒ [2]:
match_data = pd.read_csv("/Users/zelina/Desktop/IE582/HW2/match_data.csv")

After uploading the match data, we delete rows where "suspended" or "stopped" values are True.

InΒ [3]:
filt_match_data_ss = match_data[~(match_data["suspended"] | match_data["stopped"])]

Since we are asked to perform tasks for each half, we divide the filtered dataset into two, separating 1st half and 2nd half data.

InΒ [4]:
first_half_df = filt_match_data_ss[filt_match_data_ss["halftime"] == "1st-half"]
second_half_df = filt_match_data_ss[filt_match_data_ss["halftime"] == "2nd-half"]

Before moving forward, check missing values of odds in filtered data.

InΒ [5]:
missing_values = first_half_df[["1", "2", "X"]].isna().sum()
print("Missing values 1, X, 2:")
print(missing_values)

missing_values = second_half_df[["1", "2", "X"]].isna().sum()
print("Missing values 1, X, 2:")
print(missing_values)
Missing values 1, X, 2:
1    0
2    0
X    0
dtype: int64
Missing values 1, X, 2:
1    0
2    0
X    0
dtype: int64

There is no missing data in odd columns.

InΒ [6]:
# number of games 
unique_ngame = filt_match_data_ss["fixture_id"].nunique()
print(unique_ngame)
648

Task 1ΒΆ

1. Calculate probabilities as inverse of odd valuesΒΆ

Calculating probabilities as inverse of odd values for 1st half;

InΒ [7]:
# probabilities based on odd

first_half_df["p_home_win"] = 1/first_half_df["1"]
first_half_df["p_away_win"] = 1/first_half_df["2"]
first_half_df["p_tie"] = 1/first_half_df["X"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1391180215.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_home_win"] = 1/first_half_df["1"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1391180215.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_away_win"] = 1/first_half_df["2"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1391180215.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_tie"] = 1/first_half_df["X"]

Calculating probabilities as inverse of odd values for 2nd half;

InΒ [8]:
second_half_df["p_home_win"] = 1/second_half_df["1"]
second_half_df["p_away_win"] = 1/second_half_df["2"]
second_half_df["p_tie"] = 1/second_half_df["X"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3515644801.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_home_win"] = 1/second_half_df["1"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3515644801.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_away_win"] = 1/second_half_df["2"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3515644801.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_tie"] = 1/second_half_df["X"]

2. Calculate normalized probabilitiesΒΆ

Calculating normalized probabilities for 1st half;

InΒ [9]:
# normalized probabilities
first_half_df["p_total"] = first_half_df["p_home_win"] + first_half_df["p_tie"] + first_half_df["p_away_win"]
first_half_df["p_home_win_norm"] = first_half_df["p_home_win"]/first_half_df["p_total"]
first_half_df["p_away_win_norm"] = first_half_df["p_away_win"]/first_half_df["p_total"]
first_half_df["p_tie_norm"] = first_half_df["p_tie"]/first_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3436362271.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_total"] = first_half_df["p_home_win"] + first_half_df["p_tie"] + first_half_df["p_away_win"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3436362271.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_home_win_norm"] = first_half_df["p_home_win"]/first_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3436362271.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_away_win_norm"] = first_half_df["p_away_win"]/first_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3436362271.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_tie_norm"] = first_half_df["p_tie"]/first_half_df["p_total"]

Calculating normalized probabilities for 2nd half;

InΒ [13]:
second_half_df["p_total"] = second_half_df["p_home_win"] + second_half_df["p_tie"] + second_half_df["p_away_win"]
second_half_df["p_home_win_norm"] = second_half_df["p_home_win"]/second_half_df["p_total"]
second_half_df["p_away_win_norm"] = second_half_df["p_away_win"]/second_half_df["p_total"]
second_half_df["p_tie_norm"] = second_half_df["p_tie"]/second_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2508451568.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_total"] = second_half_df["p_home_win"] + second_half_df["p_tie"] + second_half_df["p_away_win"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2508451568.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_home_win_norm"] = second_half_df["p_home_win"]/second_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2508451568.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_away_win_norm"] = second_half_df["p_away_win"]/second_half_df["p_total"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2508451568.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_tie_norm"] = second_half_df["p_tie"]/second_half_df["p_total"]

3. Plot p(home_win)-p(away_win) vs p(tie)ΒΆ

Calculate the difference between P(home win) and P(away win) for 1st half

InΒ [24]:
first_half_df["p_home_minus_away"] = first_half_df["p_home_win"] - first_half_df["p_away_win"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/682054308.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["p_home_minus_away"] = first_half_df["p_home_win"] - first_half_df["p_away_win"]

Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis for 1st half

InΒ [26]:
plt.figure(figsize=(6, 4))
plt.scatter(first_half_df["p_home_minus_away"], first_half_df["p_tie"], alpha=0.7, color='blue', edgecolor='k')
plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("1st Half: P(home win) - P(away win) vs. P(tie)")

plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Calculate the difference between P(home win) and P(away win) for 2nd half;

InΒ [27]:
second_half_df["p_home_minus_away"] = second_half_df["p_home_win"] - second_half_df["p_away_win"]
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1702907173.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["p_home_minus_away"] = second_half_df["p_home_win"] - second_half_df["p_away_win"]

Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis for 2nd half

InΒ [28]:
plt.figure(figsize=(6, 4))
plt.scatter(second_half_df["p_home_minus_away"], second_half_df["p_tie"], alpha=0.7, color='blue', edgecolor='k')
plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("2nd Half: P(home win) - P(away win) vs. P(tie)")

plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Defining bins from 0 to 1 with a step of 0.05 to count game numbers within;

InΒ [29]:
bins = np.arange(0, 1.05, 0.05)  # Include 1.0 in the range
bin_labels = [f"({bins[i]:.2f}, {bins[i+1]:.2f}]" for i in range(len(bins) - 1)]

Categorize the probabilities into bins for 1st half;

InΒ [30]:
first_half_df["bin"] = pd.cut(first_half_df["p_tie_norm"], bins=bins, labels=bin_labels, include_lowest=True)
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/3362598363.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["bin"] = pd.cut(first_half_df["p_tie_norm"], bins=bins, labels=bin_labels, include_lowest=True)

Group by bins and result to count outcomes in each bin;

InΒ [31]:
bin_result_counts = first_half_df.groupby(["bin", "result"], observed=False).size().unstack(fill_value=0)

Calculating total games in all bins and normalizing counts of each outcome by bin total to see how many games fall into bin and what fraction of them are home_win, draw or away_win for 1st half.

InΒ [32]:
bin_result_counts["total"] = bin_result_counts.sum(axis=1)
bin_result_counts["draw_fraction"] = bin_result_counts["X"] / bin_result_counts["total"]
bin_result_counts["home_win_fraction"] = bin_result_counts["1"] / bin_result_counts["total"]
bin_result_counts["away_win_fraction"] = bin_result_counts["2"] / bin_result_counts["total"]
InΒ [33]:
# View the result
print(bin_result_counts)
result           1     2     X  total  draw_fraction  home_win_fraction  \
bin                                                                       
(0.00, 0.05]   409   207    53    669       0.079223           0.611360   
(0.05, 0.10]   855   330    75   1260       0.059524           0.678571   
(0.10, 0.15]  1239   575   195   2009       0.097063           0.616725   
(0.15, 0.20]  1750   943   503   3196       0.157384           0.547559   
(0.20, 0.25]  2127  1547  1256   4930       0.254767           0.431440   
(0.25, 0.30]  3460  2839  2779   9078       0.306125           0.381141   
(0.30, 0.35]  2153  1754  1951   5858       0.333049           0.367532   
(0.35, 0.40]   661   503   657   1821       0.360791           0.362987   
(0.40, 0.45]    89    85   111    285       0.389474           0.312281   
(0.45, 0.50]     0     4    37     41       0.902439           0.000000   
(0.50, 0.55]     0     0     1      1       1.000000           0.000000   
(0.55, 0.60]     0     0     0      0            NaN                NaN   
(0.60, 0.65]     0     0     0      0            NaN                NaN   
(0.65, 0.70]     0     0     0      0            NaN                NaN   
(0.70, 0.75]     0     0     0      0            NaN                NaN   
(0.75, 0.80]     0     0     0      0            NaN                NaN   
(0.80, 0.85]     0     0     0      0            NaN                NaN   
(0.85, 0.90]     0     0     0      0            NaN                NaN   
(0.90, 0.95]     0     0     0      0            NaN                NaN   
(0.95, 1.00]     0     0     0      0            NaN                NaN   

result        away_win_fraction  
bin                              
(0.00, 0.05]           0.309417  
(0.05, 0.10]           0.261905  
(0.10, 0.15]           0.286212  
(0.15, 0.20]           0.295056  
(0.20, 0.25]           0.313793  
(0.25, 0.30]           0.312734  
(0.30, 0.35]           0.299420  
(0.35, 0.40]           0.276222  
(0.40, 0.45]           0.298246  
(0.45, 0.50]           0.097561  
(0.50, 0.55]           0.000000  
(0.55, 0.60]                NaN  
(0.60, 0.65]                NaN  
(0.65, 0.70]                NaN  
(0.70, 0.75]                NaN  
(0.75, 0.80]                NaN  
(0.80, 0.85]                NaN  
(0.85, 0.90]                NaN  
(0.90, 0.95]                NaN  
(0.95, 1.00]                NaN  

Categorize the probabilities into bins for 2nd half;

InΒ [34]:
second_half_df["bin"] = pd.cut(second_half_df["p_tie_norm"], bins=bins, labels=bin_labels, include_lowest=True)
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2544697480.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["bin"] = pd.cut(second_half_df["p_tie_norm"], bins=bins, labels=bin_labels, include_lowest=True)

Group by bins and result to count outcomes in each bin for 2nd half;

InΒ [35]:
bin_result_counts_2 = second_half_df.groupby(["bin", "result"], observed=False).size().unstack(fill_value=0)

Calculating total games in all bins and normalizing counts of each outcome by bin total to see how many games fall into bin and what fraction of them are home_win, draw or away_win for 2nd half.

InΒ [36]:
bin_result_counts_2["total"] = bin_result_counts_2.sum(axis=1)
bin_result_counts_2["draw_fraction"] = bin_result_counts_2["X"] / bin_result_counts_2["total"]
bin_result_counts_2["home_win_fraction"] = bin_result_counts_2["1"] / bin_result_counts_2["total"]
bin_result_counts_2["away_win_fraction"] = bin_result_counts_2["2"] / bin_result_counts_2["total"]
InΒ [37]:
# View the result
print(bin_result_counts_2)
result           1     2    X  total  draw_fraction  home_win_fraction  \
bin                                                                      
(0.00, 0.05]  1926  1223   76   3225       0.023566           0.597209   
(0.05, 0.10]  1868  1313  230   3411       0.067429           0.547640   
(0.10, 0.15]  1603   831  365   2799       0.130404           0.572705   
(0.15, 0.20]  1313  1040  618   2971       0.208011           0.441939   
(0.20, 0.25]   982   700  572   2254       0.253771           0.435670   
(0.25, 0.30]   904   678  513   2095       0.244869           0.431504   
(0.30, 0.35]   562   344  398   1304       0.305215           0.430982   
(0.35, 0.40]   423   332  563   1318       0.427162           0.320941   
(0.40, 0.45]   523   378  702   1603       0.437929           0.326263   
(0.45, 0.50]   459   306  582   1347       0.432071           0.340757   
(0.50, 0.55]   286   253  630   1169       0.538922           0.244654   
(0.55, 0.60]   197   135  502    834       0.601918           0.236211   
(0.60, 0.65]   129    94  407    630       0.646032           0.204762   
(0.65, 0.70]    85    64  355    504       0.704365           0.168651   
(0.70, 0.75]    65    43  314    422       0.744076           0.154028   
(0.75, 0.80]    34    27  267    328       0.814024           0.103659   
(0.80, 0.85]    25    23  248    296       0.837838           0.084459   
(0.85, 0.90]    18    13  253    284       0.890845           0.063380   
(0.90, 0.95]     3     4  178    185       0.962162           0.016216   
(0.95, 1.00]     0     0    0      0            NaN                NaN   

result        away_win_fraction  
bin                              
(0.00, 0.05]           0.379225  
(0.05, 0.10]           0.384931  
(0.10, 0.15]           0.296892  
(0.15, 0.20]           0.350050  
(0.20, 0.25]           0.310559  
(0.25, 0.30]           0.323628  
(0.30, 0.35]           0.263804  
(0.35, 0.40]           0.251897  
(0.40, 0.45]           0.235808  
(0.45, 0.50]           0.227171  
(0.50, 0.55]           0.216424  
(0.55, 0.60]           0.161871  
(0.60, 0.65]           0.149206  
(0.65, 0.70]           0.126984  
(0.70, 0.75]           0.101896  
(0.75, 0.80]           0.082317  
(0.80, 0.85]           0.077703  
(0.85, 0.90]           0.045775  
(0.90, 0.95]           0.021622  
(0.95, 1.00]                NaN  

Define bins from -1 to 1 with a step of 0.05 for 1st half to compare with p(home_win)-p(away_win) vs p(tie) plot;

InΒ [43]:
bins_minus = np.arange(-1, 1.2, 0.2)  
bin_labels_minus = [f"({bins_minus[i]:.2f}, {bins_minus[i+1]:.2f}]" for i in range(len(bins_minus) - 1)]
# Categorize the probabilities into bins
first_half_df["bin_minus"] = pd.cut(first_half_df["p_home_minus_away"], bins=bins_minus, labels=bin_labels_minus, include_lowest=True)

# Group by bins and result to count outcomes in each bin
bin_result_counts_minus = first_half_df.groupby(["bin_minus", "result"], observed=False).size().unstack(fill_value=0)

# Add totals for each bin and normalize counts by bin total
bin_result_counts_minus["total"] = bin_result_counts_minus.sum(axis=1)
bin_result_counts_minus["draw_fraction"] = bin_result_counts_minus["X"] / bin_result_counts_minus["total"]
print(bin_result_counts_minus["draw_fraction"])
bin_minus
(-1.00, -0.80]    0.082907
(-0.80, -0.60]    0.165803
(-0.60, -0.40]    0.272727
(-0.40, -0.20]    0.274830
(-0.20, -0.00]    0.281860
(-0.00, 0.20]     0.360765
(0.20, 0.40]      0.353692
(0.40, 0.60]      0.270369
(0.60, 0.80]      0.141440
(0.80, 1.00]      0.088492
Name: draw_fraction, dtype: float64
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/2149057810.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half_df["bin_minus"] = pd.cut(first_half_df["p_home_minus_away"], bins=bins_minus, labels=bin_labels_minus, include_lowest=True)

We overlay the datapoints that represent real probabilities of Draw game onto p(home_win)-p(away_win) vs p(tie) plot for 1st half;

InΒ [41]:
overlay_x = [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
overlay_y = [0.082907, 0.165803, 0.272727, 0.274830, 0.281860, 0.360765, 0.353692, 0.270369, 0.141440, 0.088492]

# Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis
plt.figure(figsize=(6, 4))
plt.scatter(first_half_df["p_home_minus_away"], first_half_df["p_tie"], alpha=0.7, color='blue', edgecolor='k')

# Overlay additional data points
plt.scatter(overlay_x, overlay_y, alpha=1.0, color='red', edgecolor='black', s=100, label='Real Probabilities')

plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("1st Half: P(home win) - P(away win) vs. P(tie)")

plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Repeating for 2nd half;

InΒ [42]:
# Categorize the probabilities into bins
second_half_df["bin_minus"] = pd.cut(second_half_df["p_home_minus_away"], bins=bins_minus, labels=bin_labels_minus, include_lowest=True)

# Group by bins and result to count outcomes in each bin
bin_result_counts_minus_2 = second_half_df.groupby(["bin_minus", "result"]).size().unstack(fill_value=0)

# Add totals for each bin and normalize counts by bin total
bin_result_counts_minus_2["total"] = bin_result_counts_minus_2.sum(axis=1)
bin_result_counts_minus_2["draw_fraction"] = bin_result_counts_minus_2["X"] / bin_result_counts_minus_2["total"]

print(bin_result_counts_minus_2["draw_fraction"])
bin_minus
(-1.00, -0.80]    0.103505
(-0.80, -0.60]    0.230263
(-0.60, -0.40]    0.386364
(-0.40, -0.20]    0.453321
(-0.20, -0.00]    0.543131
(-0.00, 0.20]     0.601999
(0.20, 0.40]      0.414867
(0.40, 0.60]      0.223309
(0.60, 0.80]      0.243024
(0.80, 1.00]      0.083160
Name: draw_fraction, dtype: float64
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1090151784.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half_df["bin_minus"] = pd.cut(second_half_df["p_home_minus_away"], bins=bins_minus, labels=bin_labels_minus, include_lowest=True)
/var/folders/s4/ysq3xm291z73j3nbwgb_d8y00000gp/T/ipykernel_4821/1090151784.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  bin_result_counts_minus_2 = second_half_df.groupby(["bin_minus", "result"]).size().unstack(fill_value=0)
InΒ [45]:
overlay_x2 = [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
overlay_y2 = [0.103505, 0.230263, 0.386364, 0.453321, 0.543131, 0.601999, 0.414867, 0.223309, 0.243024, 0.083160]

# Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis
plt.figure(figsize=(6, 4))
plt.scatter(second_half_df["p_home_minus_away"], second_half_df["p_tie"], alpha=0.7, color='blue', edgecolor='k')

# Overlay additional data points
plt.scatter(overlay_x2, overlay_y2, alpha=1.0, color='red', edgecolor='black', s=100, label='Real Probabilities')

plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("2nd Half: P(home win) - P(away win) vs. P(tie)")

plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Task 2ΒΆ

Divide the dataset into two once again;

InΒ [46]:
first_half_df_2 = match_data[match_data["halftime"] == "1st-half"]
second_half_df_2 = match_data[match_data["halftime"] == "2nd-half"]

Identify games with a red card in the first 15 minutes and remove them;

InΒ [47]:
red_card_games = first_half_df_2[(first_half_df_2["minute"] <= 20) & ((first_half_df_2["Redcards - away"] == 1) | (first_half_df_2["Redcards - home"] == 1))]["fixture_id"].unique()

# Remove these games from the dataset
first_half_df_2_remove_red = first_half_df_2[~first_half_df_2["fixture_id"].isin(red_card_games)].reset_index(drop=True)
InΒ [48]:
num_removed_games_red = len(red_card_games)
print(num_removed_games_red)
6

Only 6 games are removed.

Identify games with a goal in the last ~5 minutes;

InΒ [49]:
goal_games = second_half_df_2[(second_half_df_2["minute"] >= 40) & ((second_half_df_2["Score Change - away"] == 1) 
                                                                    | (second_half_df_2["Score Change - home"] == 1))]["fixture_id"].unique()
# Remove these games from the dataset
second_half_df_2_remove_goal = second_half_df_2[~second_half_df_2["fixture_id"].isin(goal_games)].reset_index(drop=True)
InΒ [50]:
num_removed_games_goal = len(goal_games)
print(num_removed_games_goal)
229

Removed 229 games.

Recalculate probabilities based on odds for 1st half and 2nd half after removal;

InΒ [51]:
first_half_df_2_remove_red["p_home_win"] = 1/first_half_df_2_remove_red["1"]
first_half_df_2_remove_red["p_away_win"] = 1/first_half_df_2_remove_red["2"]
first_half_df_2_remove_red["p_tie"] = 1/first_half_df_2_remove_red["X"]
InΒ [52]:
second_half_df_2_remove_goal["p_home_win"] = 1/second_half_df_2_remove_goal["1"]
second_half_df_2_remove_goal["p_away_win"] = 1/second_half_df_2_remove_goal["2"]
second_half_df_2_remove_goal["p_tie"] = 1/second_half_df_2_remove_goal["X"]

Calculate the difference between P(home win) and P(away win) for 1st half after removal;

InΒ [53]:
first_half_df_2_remove_red["p_home_minus_away"] = first_half_df_2_remove_red["p_home_win"] - first_half_df_2_remove_red["p_away_win"]

Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis and real probabilities of Draw for 1st half after removal;

InΒ [64]:
overlay_x3 = [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
overlay_y3 = [0.075881, 0.171190, 0.266027, 0.274446, 0.278935, 0.361219, 0.350442, 0.270294, 0.141387, 0.089564]
# Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis for 1st half
plt.figure(figsize=(6, 4))
plt.scatter(first_half_df_2_remove_red["p_home_minus_away"], first_half_df_2_remove_red["p_tie"], alpha=0.7, color='blue', edgecolor='k')
plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.scatter(overlay_x3, overlay_y3, alpha=1.0, color='red', edgecolor='black', s=100, label='Real Probabilities')
# Add labels and title
plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("1st Half-removed: P(home win) - P(away win) vs. P(tie)")

# Show the plot
plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Calculate the difference between P(home win) and P(away win) for 2nd half;

InΒ [55]:
second_half_df_2_remove_goal["p_home_minus_away"] = second_half_df_2_remove_goal["p_home_win"] - second_half_df_2_remove_goal["p_away_win"]

Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis and real probabilities of Draw for 2nd half after removal;

InΒ [66]:
overlay_x4 = [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
overlay_y4 = [0.011953, 0.149219, 0.337192, 0.530070, 0.646821, 0.740000, 0.596052, 0.284173, 0.138439, 0.026003]
# Plot P(home win) - P(away win) on x-axis and P(tie) on y-axis
plt.figure(figsize=(6, 4))
plt.scatter(second_half_df_2_remove_goal["p_home_minus_away"], second_half_df_2_remove_goal["p_tie"], alpha=0.7, color='blue', edgecolor='k')
plt.axhline(0, color='gray', linestyle='--', linewidth=0.8)  
plt.axvline(0, color='gray', linestyle='--', linewidth=0.8)  

plt.scatter(overlay_x4, overlay_y4, alpha=1.0, color='red', edgecolor='black', s=100, label='Real Probabilities')

plt.xlabel("P(home win) - P(away win)")
plt.ylabel("P(tie)")
plt.title("2nd Half-removed: P(home win) - P(away win) vs. P(tie)")

plt.grid(alpha=0.3)
plt.show()
No description has been provided for this image

Task 3ΒΆ

InΒ [57]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
InΒ [58]:
match_data = pd.read_csv("/Users/zelina/Desktop/IE582/HW2/match_data.csv")
first_half_df_3 = match_data[match_data["halftime"] == "1st-half"]

I selected some columns relevant to the outcome of the game to build the decision tree.

InΒ [59]:
X = first_half_df_3[["Ball Possession % - home", "Ball Possession % - away", "Dangerous Attacks - home", "Dangerous Attacks - away",
                "Goal Attempts - home", "Goal Attempts - away", "Goals - home", "Goals - away", 
                "Penalties - home", "Penalties - away", "Redcards - home", "Redcards - away", 
                "Score Change - home", "Score Change - away", 
                "Successful Passes Percentage - home", "Successful Passes Percentage - away", 
                ]]  
y = first_half_df_3["result"]
InΒ [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
InΒ [61]:
# Initialize the model
tree = DecisionTreeClassifier(max_depth=4, random_state=42) 

# Train the model
tree.fit(X_train, y_train)
Out[61]:
DecisionTreeClassifier(max_depth=4, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4, random_state=42)

Visualize the tree. (Please note that this box is markdown since the output of code is not readible and I pasted the tree externally.)

plt.figure(figsize=(15, 10)) plot_tree(tree, feature_names=X.columns, class_names=["1", "X", "2"], filled=True) plt.show()

decision_tree.png

Making the tree for 2nd half;

InΒ [68]:
second_half_df_3 = match_data[match_data["halftime"] == "2nd-half"]
X_2 = second_half_df_3[["Ball Possession % - home", "Ball Possession % - away", "Dangerous Attacks - home", "Dangerous Attacks - away",
                "Goal Attempts - home", "Goal Attempts - away", "Goals - home", "Goals - away", 
                "Penalties - home", "Penalties - away", "Redcards - home", "Redcards - away", 
                "Score Change - home", "Score Change - away", 
                "Successful Passes Percentage - home", "Successful Passes Percentage - away", 
                ]]  
y_2 = second_half_df_3["result"]
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.3, random_state=42)




# Initialize the model
tree_2 = DecisionTreeClassifier(max_depth=4, random_state=42) 

# Train the model
tree_2.fit(X_train_2, y_train_2)
Out[68]:
DecisionTreeClassifier(max_depth=4, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4, random_state=42)

Visualize the tree

plt.figure(figsize=(15, 10)) plot_tree(tree_2, feature_names=X_2.columns, class_names=["1", "X", "2"], filled=True) plt.show()

decision_tree_2.png

ReferencesΒΆ

For this homework, I used ChatGPT for coding assistance and some brief insights on decision trees. I also took a look at a submitted homework just to understand some parts of the tasks because the text of this assingment was very hard for me to understand.